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Abstract 

We introduce a corpus of 7,032 sentences' 
rated by human annotators for formality, in¬ 
formativeness, and implicature on a 1-7 scale. 

The corpus was annotated using Amazon Me¬ 
chanical Turk.^ Reliability in the obtained 
judgments was examined by comparing mean 
ratings across two MTurk experiments, and 
correlation with pilot annotations (on sentence 
formality) conducted in a more controlled set¬ 
ting. Despite the subjectivity and inherent dif¬ 
ficulty of the annotation task, correlations be¬ 
tween mean ratings were quite encouraging, 
especially on formality and informativeness. 

We further explored correlation between the 
three linguistic variables, genre-wise variation 
of ratings and correlations within genres, com¬ 
patibility with automatic stylistic scoring, and 
sentential make-up of a document in terms 
of style. To date, our corpus is the largest 
sentence-level annotated corpus released for 
formality, informativeness, and implicature. 

1 Introduction 

Consider the two following utteranees:^ 

1. This is to inform you that 
your book has been rejected 
by our publishing company as 
it was not up to the required 

'a more recent (Aug 30, 2016) version of this paper 
appears at http://web.eecs.umich.edu/~lahiri/ 
new_draft.pdf. 

^https://www.mturk.com/mturk/welcome. 

^Courtesy: http: //www. word-mart. com/html/ 

formal_and_informal_writing.html 


standard. In case you would 
like us to reconsider it, we 
would suggest that you go over 
it and make some necessary 
changes. 

2. You know that book I wrote? 

Well, the publishing company 
rejected it. They thought it 
was awful. But hey, I did the 
best I could, and I think it 
was great. I'm not gonna redo 
it the way they said I should. 

Not only are the styles of the two utteranees dif¬ 
ferent (first one is formal, seeond one is informal), 
but they are also targeted at different people. This 
diehotomy of (in)formal expressions was examined 
in great detail by Heylighen and Dewaele (1999). 
As they observed, formality is the most important 
dimension of writing style (ef. (Biber, 1988; Hud¬ 
son, 1994)),^ and has elose eonneetions to infor¬ 
mativeness and implieature. They argued, in par- 
tieular, that formality emerges out of a eommuniea- 
tive objeetive - to maximize the amount of informa¬ 
tion being eonveyed to the listener while at the same 
time maintaining (or at least appearing to maintain) 
Griee’s eommunieative maxims of Quality, Quan¬ 
tity, Relevanee and Manner as mueh as possible 
(Griee, 1975). 

Heylighen and Dewaele introdueed the notion of 
deep formality - “avoidanee of ambiguity by mini¬ 
mizing the eontext-dependenee and fuzziness of ex- 

"^For a general discussion on the theory of registers, see 
(Levelt, 1989) and (Leckie-Tarry and Birch, 1995). 



pression”, and reasoned that the other type of for¬ 
mality {surface formality, formalizing language for 
stylistie effeets) is a eorruption of the language’s 
original deep purpose. Deep formality was eharae- 
terized by a laek of contextuality, evideneed in par- 
tieular by deereased levels of deixis and implicature 
in linguistie realizations. 

While several of the arguments Heylighen and 
Dewaele made are open to question, an important 
take-home message from their theory is a so-ealled 
continuum of formality, arising out of a proeess 
where a doeument (or a pieee of text) ean be “for¬ 
malized” ad infinitum, simply by adding more and 
more eontext. This preeludes us from labeling a doe¬ 
ument or a sentenee binarily as “formal” or “infor¬ 
mal”. We will instead follow the Likert seale ap- 
proaeh (Likert, 1932) to sentenee formality annota¬ 
tion, shown to work well by Lahiri and Lu (2011). 
In some sense, our work is similar to the Stanford 
politeness eorpus (Daneseu-Nieuleseu-Mizil et ah, 
2013); both eorpora are at the sentenee/utteranee 
level, and both measure a pragmatie variable on an 
ordinal seale (formality vs politeness). 

2 Background and Related Work 
2.1 Formality 

Heylighen and Dewaele’s study, while seminal in the 
field of formality seoring, had its limitations. Al¬ 
though they stressed the relationship between eon- 
textuality (missing information) and implieature, it 
was never quantified. They also refrained from 
quanlifying implieafure ilself - fo “avoid all infri- 
eaeies af fhe level of phonefies, synfax, semanfies 
and pragmafies”, eifing fhaf fhe “reeognifion of pho- 
nefie pafferns, synfaefieal parsing, and even more se- 
manfie and pragmafie inferprefafion of nafural lan¬ 
guage are still exfremely diffieull... fo perform au- 
fomafieally.” Furfher, we suspeef fhaf fhe relation 
befween deep formality and implieafure mighf have 
been over-emphasized (ef. Seefion 4.2). 

In fhe end, fhey quanfified formality using deixis 
only (pereenfage differenee befween deiefie and 
non-deiefie parfs-of-speeeh), whieh we will henee- 
forlh refer fo as fhe “F-seore”.^ F-seore was used in 
genre analysis by Nowson ef al. (2005), and shown 

^Not to be confused with the harmonic mean of precision 
and recall. 


fo be quite effeelive in diseriminafing befween fhe 
17 genres used in fheir sfudy. Furfher, sysfemafie 
variafion in F-seore was observed aeross gender and 
personalify fraifs. Teddiman (2009) nofed in parfie- 
ular fhaf F-seore ean sueeessfully differentiafe be¬ 
fween genres, buf if eannof explain why fhe genres 
are differenl. F-seore was found fo be fhe same for 
diary enfries, and eommenfs on fhose enfries.^ In 
follow-up work, Li el al. (2013) proposed a version 
of F-seore (ealled “CF-seore”) based on Coh-Mefrix 
(Graesser el ah, 2004) dimensions of narralivily, ref¬ 
erential and deep eohesion, synlaelie simplieily and 
word eonereleness. CF-seore was heifer able fo dis- 
eriminale befween genres lhan F-seore. 

In a separale slrand of work, Brooke and Hirsl 
(2014) identified formality as a eonlinuous lexi- 
eal allribule, and assigned a formalily seore fo a 
word based on ils eo-oeeurreee frequeney wilh a 
hand-pieked seed sel of formal and informal words, 
smoolhed by lalenl semanfie analysis (Brooke el ah, 
2010). Formalily of words was furfher shown fo 
be eorrelaled wilh olher slylislie dimensions sueh 
as eonereleness and subjeelivily (Brooke and Hirsl, 
2013). 

While all fhe above sludies are very imporlanl, 
fhey looked al formality from doeumenf and word 
levels, nol from fhe senfenee level. Abu Sheikha 
and Inkpen (2012) equaled formality of a sentenee 
wilh fhe formality of ils eorresponding doeumenf, 
and Brooke and Hirsl (2014) prediefed formalily of 
senlenees using word-level fealures. Pelerson el al. 
(2011) and Maehili (2014) looked info formality of 
emails al workplaee, fhe former exploring fhe En¬ 
ron eorpus and how formalily varies wilh soeial dis- 
fanee, relafive power, and fhe weighl of imposifion, 
and fhe laller eondueling similar analyses among 
workplaee emails from Greek multinational eompa- 
nies. 

As Lahiri el al. (2011) showed in fheir work, sen¬ 
tenee formalily is not fhe same as doeumenf for¬ 
malily. While if is Irue fhaf senlenees do follow 
doeumenl-level Irends, if was observed fhaf Ihere is 
a wide spread among senlenees in ferms of formal¬ 
ily - nol all senlenees from a doeumenf are equally 
formal (ef. (Lahiri and Lu, 2011), and Seelion 4.3 

®This could be due to linguistic style co-ordination 
(Danescu-Niculescu-Mizil, 2012). 



of this paper). Lahiri and Lu (2011) further showed 
that there are eases where the words in a sentenee 
are formal, but the sentenee as a whole is not ( “For 
all the stars in the sky, I do not care.”) — thus raising 
questions regarding a straightforward applieation of 
lexieal formality to explain sentenee formality.^ 

The only two studies we are aware of that looked 
into formality annotation of sentenees, are (Lahiri 
and Lu, 2011), and (Dethlefs et ah, 2014). Lahiri 
and Lu annotated 600 sentenees by two undergrad¬ 
uate linguisties students on a Likert seale of 1-5. 
Inter-rater agreement was shown to improve sub¬ 
stantially from binary annotations, whieh eould be 
attributed to the continuum of formality phenomenon 
deseribed in Seetion 1. Dethlefs et ah, on the other 
hand, were interested in formality from a natural 
language generation (NLG) perspeetive.* They an¬ 
notated utteranees using Amazon Meehanieal Turk 
on three dimensions of style - eolloquialism (oppo¬ 
site of formality), politeness, and naturalness. A 1-5 
Likert seale was used. The problem with this study 
is that the number of annotated sentenees was quite 
limited, and they eame from a restrieted elass of doe- 
uments talking about restaurant reviews in a single 
eity. This makes Dethlefs et al.’s eorpus unsuitable 
for our purpose. We wanted a generie eorpus of sen¬ 
tenees annotated with formality ratings that eould 
help build a sentenee formality predietor, so we ex¬ 
tended the work of Lahiri and Lu (2011) instead. 

2.2 Implicature 

A seeond issue with Heylighen and Dewaele’s F- 
seore is that it is unreliable on small doeuments, 
sueh as sentenees and utteranees (ef. (Lahiri et ah, 
2011)). It is therefore of interest to examine if the 
F-seore eorrelates with human notion of formality 
at sentenee level (ef. Seetion 4.2). But perhaps even 
more importantly, it shows a big limitation in the for¬ 
mulation of F-seore: it is based on deixis only, and 
fails to take into aeeount the amount of implicature 
present in a sentenee. 

Note that in general, it is true that as we add more 
eontext to a doeument (or a sentenee), it tends to be- 
eome longer. The opposite is also true: as we rob 

’Also see the examples given by Potts (2012). 

*Note that the importance of formality in language genera¬ 
tion has long been recognized (Hovy, 1990; Abu Sheikha and 
Inkpen, 2011). 


a doeument (or sentenee) of eontext, it tends to be- 
eome shorter {contextual). So it eould be reasoned 
that sentenees by themselves have a lot of un-stated 
eontext (as eompared to a doeument), whieh are re¬ 
solved by looking at neighboring sentenees.^ So if 
we eould somehow estimate the amount of “miss¬ 
ing” eontext in a sentenee, we would be one more 
step ahead in assessing its true formality. 

Quantifying the missing eontext is eomplieated 
by the faet that it depends on both deixis and impli- 
eature. While F-seore gives a reasonable estimate of 
the amount of relative deixis present in a sentenee, it 
does not give any estimate of the amount of impliea- 
ture. This foreed us to rate sentenees for the amount 
of implieature they earry (on Likert seale, beeause 
implieature is a eontinuous attribute (Degen, 2015)). 
This annotation proeess not only gave us implieature 
ratings, but also allowed us to look into how subjee- 
tive the eoneept of implieature is (ef. Seetion 3.2). 

Note that Degen (2015) had already eondueted a 
similar study on implieature annotation using Me¬ 
ehanieal Turk. However, the foeus of her study was 
on one partieular type of implieature {some but not 
all), and the annotation proeess was not tied to for¬ 
mality or any other stylistie attribute. Also to be 
noted is the faet that our annotated eorpus of 7,032 
sentenees is mueh larger than Degen’s eorpus of 
1,363 utteranees. 

A general diseussion of the vast literature on im¬ 
plieature (starting with Griee (1975), and expanded 
by Harnish (1976), among others) is beyond the 
seope of this paper. Interested readers are referred 
to the exeellent book by Potts (2005) for a gentle 
introduetion to the theory of conventional implica- 
tures (CIs), and to (Levin and Prinee, 1986; Benotti, 
2010; Benotti and Blaekburn, 2011) for a diseussion 
on causal implicatures. Griee also introdueed scalar 
implicatures - arguably the most prominent elass of 
implieatures - that equate “some” with “not all” for 
the sake of politeness. Papafragou and Musolino 
(2003) diseussed the aequisition of sealar impliea¬ 
tures by ehildren, and Carston (1998) related sealar 
implieatures with relevanee and informativeness - a 
topie we will briefly visit in the next seetion. 

Apart from Degen (2015), we are not aware of any 

®Much like resolving the meaning of a word by looking at 
neighboring words. 



work that specifically looked into implicature rating 
at sentence/utterance level. Degen’s work, as we al¬ 
ready pointed out, is not tied to formality scoring, so 
we used our own dataset of 7,032 sentences to rate 
for both formality and implicature. 

2.3 Informativeness 

We also rated sentences for informativeness - a trait 
Heylighen and Dewaele (1999) identified wifh deep 
formality, where language is formalized fo commu- 
nicafe meaning more clearly and direcfly. We will 
fesf fhis hypofhesis by checking if fhe formalily of 
a senfence positively correlafes wifh ifs informafive- 
ness (Section 4.2). Interestingly, Carsfon (1998) in- 
dependenfly arrived af a similar conclusion: “infor- 
mafiveness principles... give rise fo... a sfrengfhen- 
ing or narrowing down of fhe encoded meaning of 
fhe ufferance.” While Carsfon’s specific argumenf 
was tied fo scalar implicafures, if is nol very far- 
felched fo see fhaf fhe same argumenf would, in ef- 
fecl, also apply fo deep formality as evinced by Hey¬ 
lighen and Dewaele. 

If is fo be noted fhaf fhe word informativeness 
has differenl connofafions in differenl sellings. In 
fhe machine Iranslalion community, for example, fhe 
word informativeness denoles a lype of fidelity mea¬ 
sure fo be applied fo fhe Iranslaled fexl - in order 
fo verify how much conlenl of fhe original fexl is 
preserved under fhe Iranslalion (Rajman and Harl- 
ley, 2001). Informativeness of words and phrases 
is an imporlanl parameler in problems ranging from 
named enlily detection (Rennie and Jaakkola, 2005) 
fo keyword exlraclion (Timonen el ah, 2012). Under 
fhis selling, informativeness is known as term infor¬ 
mativeness (Kireyev, 2009; Wu and Giles, 2013). In¬ 
terestingly, Rennie and Jaakkola (2005) pointed oul 
lhal Iheir term informativeness estimation approach 
would be especially helpful in “exlracling informa¬ 
tion from informal, wrillen communication” (em¬ 
phasis ours). 

While all Ihe above sludies are imporlanl in Iheir 
own righl, and ground-breaking in some cases, we 
found none lhal specifically looked into informative¬ 
ness rating of sentences in Ihe conlexl of formality, 
and Ihere is no publicly available annolaled dalasel 
for sentence informativeness. In Ibis work, we will 
bridge Ihe gap. 


3 Corpus Creation 

3.1 Data 

Our data comes from the pioneering study of Lahiri 
el al. (2011). They compiled four different datasets 
- blog posts, news articles, academic papers, and 
online forum threads - each consisting of 100 doc¬ 
uments. For the blog dataset, they collected most 
recent posts from the top 100 blogs listed by Tech- 
norati^° on October 31, 2009. For the news arti¬ 
cle dataset, they collected 100 news articles from 
20 news sites (five from each). The articles were 
mostly from “Breaking News”, “Recent News”, and 
“Local News” categories, with no specific prefer¬ 
ence attached to any particular category.^' For the 
academic paper dataset, they randomly sampled 100 
papers from the CiteSeerX'^ digital library. For the 
online forum dataset, they sampled 50 random doc¬ 
uments crawled from the Ubuntu Forums, and 50 
random documents crawled from the TripAdvisor 
New York forum.The blog, news, paper, and fo¬ 
rum datasets had 2110, 3009, 161406 and 2569 sen¬ 
tences respectively. 

We manually cleaned and sentence-segmented the 
blog, news, and forum datasets to come up with 
7,032 unique sentences. The much larger and more 
complex paper dataset was discarded, because man¬ 
ual cleansing and sentence segmentation of text 
data extracted from PDF was prohibitively time- 
consuming, and often unsuccessful because of spu¬ 
rious characters, words, and corrupted/missing seg¬ 
ments of text.^^ 

3.2 Annotation 

With the 7,032 sentences, we conducted two Me¬ 
chanical Turk annotation experiments. In our first 

'®http: / /technorati . com/. 

"The news sites were CNN, CBS News, ABC News, 
Reuters, BBC News Online, New York Times, Los Ange¬ 
les Times, The Guardian (U.K.), Voice of America, Boston 
Globe, Chicago Tribune, San Francisco Chronicle, Times On¬ 
line (U.K.), news.com.au, Xinhua, The Times of India, Seattle 
Post Intelligencer, Daily Mail, and Bloomberg L.P. 

"http://citeseerx.ist.psu.edu/. 

"http://ubuntuforums.org/. 

"http://www.tripadvisor.com/ShowForum- 
g607 63-i5-New_York_CitY_New_York.html. 

"Note that this manual cleaning was necessary for our anno¬ 
tation process, because we cannot expect our annotators to deal 
with corrupt/incomplete/inaccurate sentences. 




Overall 

Blog 

News 

Forum 

Eormalily 

0.68 

0.60 

0.35 

0.48 

Informativeness 

0.64 

0.63 

0.42 

0.63 

Impliealure 

0.14 

0.19 

0.09 

0.11 


Table 1: Spearman’s p between the mean ratings obtained from our Meehanieal Turk experiments. All 
results are statistieally signifieantly different from zero, with p-value < 0.0001. 



Overall 

Blog 

News 

Forum 

MTurk Experimenl 1 

0.78 

0.73 

0.32* 

0.49 

MTurk Experimenl 2 

0.73 

0.61 

0.30* 

0.53 


Table 2: Spearman’s p between the mean formality ratings from Meehanieal Turk, and mean formality 
ratings from Lahiri and Lu (2011). All results are statistieally signifieantly different from zero, with p-value 
< 0.0001. For the results marked with a *, their p-values are < 0.01. 


experiment, Turkers were requested to rate sen- 
tenees on a 1-7 seale for formality, informativeness, 
and implieature. Eaeh sentenee was a HIT (Hu¬ 
man Intelligenee Task), and we requested five as¬ 
signments per HIT so fhaf we eould gef five inde- 
pendenf ratings for eaeh senfenee. We requesfed 
Turkers wifh English as firsl language in our HIT 
fifle*^ and deseripfion,^’ buf fhere was no easy way 
fo ensure fhaf if was indeed fhe ease. As a quiek 
fix, we required “Turkers from US” as qualifieafion, 
and hoped fhaf fhe average aeross five independenf 
ratings will painf a heller piefure lhan any individ¬ 
ual rating alone. Our inslruefions were minimal - 
we slarled wifh fhe fwo examples given af fhe be¬ 
ginning of Seelion 1 lo prime fhe Turkers wifh fhe 
nolion of formality, and gave Ihem a few more links 
lo explore fhe eoneepl on Iheir own.^^ Then we fold 
Ihem lo rale senlenees on how formal Ihey are. Turk¬ 
ers were requesfed lo be consistent in Iheir ralings 
aeross senlenees, and rale senlenees independenlly 

'®How formal is this sentence? English 
as first language required. 

''^This is a formality survey HIT, where 
we have three stylistic questions on an 
English sentence. Please do not enter if 
you do not have English as first language. 

'*http://www.engvid.com/english- 
resource/formal-informal-english/, http: 
//dictionary.Cambridge.org/us/grammar/ 
british-grammar/formal-and-informal- 
language, http:// WWW .englishspark.com/ 

informal-language/, http://www.antimoon. 
com/how/formal-informal-english.htm. 


of eaeh olher. The order of presenlalion of Ihe sen¬ 
lenees was serambled so as lo remove any potential 
sequenee effeel. In lolal, 527 Turkers parlieipaled in 
our firsl experiment 

Note, however, lhal assessing inler-raler agree- 
menl beeomes diffieull on Meehanieal Turk beeause 
differenl Turkers work on differenl number of HITs. 
Eurlhermore, we had no qualify eonlrol olher lhan 
“US-based” in our firsl experiment This is why we 
eondueled a seeond experiment whieh was essen¬ 
tially idenlieal lo Ihe firsl, exeepl lhal now we added 
Iwo more requirement - al leasl 1,000 HITs eom- 
pleled wifh al leasl 99% approval rate - on lop of Ihe 
US-based requirement This resulted in 187 Turkers 
parlieipaling in our seeond experiment 

Correlations belween Ihe mean ratings oblained 
from Ihese Iwo experimenls are shown in Table 1. 
Several Ihings are lo be noted from Ihis fable. Eirsl, 
note lhal even wilhoul qualify eonlrol (and weak 
enforeemenl of Ihe English-firsl-language poliey), 
Turkers’ mean ratings eorrelaled prelly well (aeross 
Iwo experimenls) for bolh formalily as well as infor¬ 
mativeness, eehoing previous findings by Eahiri and 
Eu (2011). Seeond, il shows lhal even wilhoul exten¬ 
sive and delailed inslruefions, Turkers were able lo 
rate subjeelive eoneepls like “formalily” and “infor¬ 
mativeness” quite well, again eehoing Ihe findings 
summarized by Eahiri and Eu. Note lhal we did nol 
provide Turkers wilh extensive and delailed inslrue¬ 
fions beeause: 






High 

Low 

Formality 

And in its middle-class neighborhoods, Baghdad is a city 
of surprising topiary sculptures: leafy ficus trees are 
carved in geometric spirals, balls, arches and squares, as 
if to impose order on a chaotic sprawl. 

Thanx! 

Informativeness 

According to the Shanghai Jiao Tong University Press, 
the press is currently compiling a picture album of Qian 
and a collection of his writings based on 800-plus-page 
documents retrieved from the U.S. National Archives, 
which include details about his encounters with the U.S. 
government and his trip back home. 

Any recommendations? 

Implicature 

Who will join? 

Most mornings they rise before their rooster crows, bolting 
down a meager breakfast of coconut and chile-spiced 
vegetables over rice before venturing out on their journey: 
rowing to school aboard a hand-carved 15-foot sampan. 


Table 3: Example sentenees with high and low mean MTurk ratings for formality, informativeness, and 
implicature. 


• We did not want to bias them with our view of 
the English language (removing experimenter 
bias). 

• We wanted to see if Eikert seale annotations 
were good enough (as claimed by Eahiri and Eu 
(2011)) to instil sufficient reliability and agree¬ 
ment in the annotation process, especially be¬ 
tween mean ratings. 

• We wanted to see if mean ratings across mul¬ 
tiple raters could effectively eliminate the id¬ 
iosyncrasies of individual Turkers in a subjec¬ 
tive annotation task like this.'^ 

Having said that, note from Table 1 that the corre¬ 
lation values for implicature are rather low - across 
all genres (albeit positive). This is unsurprising, 
however, given that implicature is arguably the most 
subjective among the three pragmatic variables we 
investigated, and quite possibly, the least amenable 
to any straightforward syntactic, lexical, or semantic 
explanation. 

*®Here are the three questions we asked: How formal 
do you think is the above sentence? How 
much information do you think the above 
sentence carries? How much do you think 
the above sentence implies/suggests, or 
leaves to possible interpretations? We also 
had optional comment boxes so that Turkers can leave us their 
thoughts on the annotation process. 


We further compared our mean formality ratings 
from Mechanical Turk to the mean formality ratings 
reported by Eahiri and Eu (2011) in their “actual” 
annotation phase. Results are shown in Table 2. 
Note that the mean Turker ratings are highly posi¬ 
tively correlated with the mean ratings from Eahiri 
and Eu’s quality-controlled study - except the news 
genre, where correlations are weaker (also see Ta¬ 
ble 1). We plan to investigate the news genre in 
future work. But the overall patterns are strongly 
encouraging, and validate the idea that a formality- 
annotated corpus can indeed be built reliably with 
Eikert-scale-style annotations. 

We show some example high- and low- formal¬ 
ity, informativeness and implicature sentences in Ta¬ 
ble 3.^*^ Note that they follow the usual intuitions 
about formality, informativeness, and implicature 
quite well; for example, sentences that are high in 
formality and informativeness, but low in implica¬ 
ture, are longer and more difficult to read. The op¬ 
posite is also true; informal and uninformative sen¬ 
tences are much shorter, and are often laden with 
a lot of implicature.^^ Eor the rest of the paper. 


^®The full dataset is available at 
https://drive.google.com/file/d/ 
0B2Mzhc7popBgdXZmRlg2RUdqdDA/view?usp= 
sharing. Examples in Table 3 are from our second MTurk 
experiment, which comprises better-qualified Turkers. 

^'interesting trivia: the title of this paper derives from a sen¬ 
tence in our corpus that is very low in formality and informa- 







Formality Informativeness Implicature 


Figure 1: Genre-wise variation of formality, informativeness, and implicature (can be viewed in grayscale). 


we only consider the mean ratings from our second 
MTurk experiment, which comprises better-qualified 
Turkers. For notational convenience, mean ratings 
will henceforth be referred to as Formality, Infor¬ 
mativeness and Implicature, as appropriate. 

4 Experiments 

We performed three separate experiments on the 
7,032 annotated sentences to identify different as¬ 
pects of the annotations. In our first experiment, we 
explored how sentence-level formality, implicature, 
and informativeness vary across three different on¬ 
line genres - news, blog, and forums (Section 4.1). 
In the second experiment, we investigated the cor¬ 
relation among these three variables, and correla¬ 
tion with stylistic scores (Section 4.2). Finally, in 
Section 4.3, we examined how documents varied in 
terms of sentential formality, informativeness, and 
implicature - on average. 

4.1 Genre-wise Variation 

We plot five-bin hisfograms of formalify, informa¬ 
tiveness, and implicafure in Figure 1. Nofe from Fig¬ 
ure 1 fhaf overall, our corpus is dominated by high¬ 
informativeness, mid-to-high-formality, and mid- 
implicature sentences. Since our implicature rating 
is less reliable than the other two ratings (cf. Sec¬ 
tion 3.2), it is relatively unclear whether this mid- 
implicature trend is a real phenomenon, or is more 
of a reflection of central tendency bias among the 
annotators - who, lacking a better choice and a bet¬ 
ter interpretation - chose middling values for the im¬ 
plicature rating. Central tendency in implicature is 


also observed for the three individual genres - news, 
blog, forums. 

The news genre is dominated by high¬ 
informativeness, and mid-to-high-formality 
sentences; blogs, too, are mostly high-formality 
and mid-to-high-informativeness sentences; on the 
other hand, forums are dominated by mid-to-low- 
formality sentences, and are spread out almost 
evenly when it comes to informativeness. The 
general trends corroborate earlier studies (Lahiri et 
ah, 2011; Lahiri and Lu, 2011). 

The fact that forums are spread out in terms of 
(sentential) informativeness shows that there are all 
kinds of sentences in forums - some are very infor¬ 
mative, some are somewhat informative, and some 
are uninformative (e.g., help-eliciting sciences such 
as “help please!”, sentences expressing gratitude 
such as “Thanks everybody!”, and suggestive sen¬ 
tences such as “give it a shot.”). Filtering forum sen¬ 
tences by informativeness may be a useful first step 
towards effective mining of forum data. 

4.2 Relationship with Others 

We experimented with eight different sentential 
stylistic variables, as detailed below: 

1. Fo: Formality of the sentence, i.e., the mean 
formality rating assigned by Turkers in our sec¬ 
ond MTurk experiment. 

2. In: Informativeness of the sentence, i.e., the 
mean informativeness rating assigned by Turk¬ 
ers in our second MTurk experiment. 

3. Im: Implicature of the sentence, i.e., the mean 
implicature rating assigned by Turkers in our 
second MTurk experiment. 


tiveness, and medium in implicature. 
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Table 4: Spearman’s p between stylistic variables, as explained in text. Most of the results are statistically 
significantly different from zero, with p-value < 0.0001. For the results marked with a *, p-values are < 
0.01; for those marked with a **, p-values are < 0.05. Results in italics are statistically insignificant. 
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Figure 2: Sentential make-up of formality, informativeness, and implicature (can be viewed in grayscale). 















































4. Lw: Length of the sentenee in words. 

5. Lc: Length of the sentenee in eharaeters. 

6. F: Formality seore of the sentenee, as proposed 
by Heylighen and Dewaele (1999). 

7. I: Informativeness seore of the sentenee. 

8. LD: Lexieal density of the sentenee (Ure, 
1971). 

Among these variables, Heylighen and Dewaele’s 
formality seore is given by: 

F = (noun frequency + adjective freq. + preposi¬ 
tion freq. -v article freq. - pronoun freq. - verb freq. 
- adverb freq. - interjection freq. + 100)/2 

where the frequeneies are taken as pereentages 
with respeet to the total number of words in the sen¬ 
tenee. The inspiration for this seore eomes from the 
faet that nouns, adjeetives, prepositions, and artieles 
are found to be non-deictic in word eorrelation stud¬ 
ies, whereas pronouns, verbs, adverbs, and interjee- 
tions are found to be deictic. F-seore measures for¬ 
mality as the amount of relative non-deixis present 
in a sentenee (ef. Seetion 2.1). 

Ure’s lexieal density takes the form: 

LD = (Niex/N) X 100 

where is the number of lexical tokens (nouns, 

adjeetives, verbs, adverbs) in the sentenee, and N is 
the total number of words in the sentenee. 

The informativeness score (I) is a seoring formula 
we propose in this paper. The idea is as follows. 
Reeall from Seetion 1 that contextuality - the op¬ 
posite of deep formality - is affeeted by both deixis 
as well as implieature. Although implieature is very 
hard to quantify, a measure of “ambiguity” in a given 
pieee of text ean be formulated by eounting how 
many WordNet senses (Miller, 1995) the words in 
that text earry on average. The more senses words 
have, the more ambiguous the text is. The informa¬ 
tiveness score (I) of a sentenee is thus given by the 
average number of WordNet senses per word in the 
sentence.^^ 

Correlations between the eight variables are given 
in Table 4. Note from Table 4 that formality and 

^^Conjunctions are deixis-neutral. We used CRFTagger 
(Phan, 2006) to pait-of-speech-tag our sentences. 

^^More accurately, it should be called an ambiguity score. 


informativeness are highly eorrelated in all eases, 
thereby validating Heylighen and Dewaele’s hypoth¬ 
esis that the purpose of formality (deep formality 
in partieular) is more informative communication. 
Note, however, that in most eases, there is very 
little eorrelation between formality and implieature 
(small positive/negative values). There are two pos¬ 
sible reasons for this: (a) implieature is a poorly- 
understood phenomenon, and maybe formality and 
implieature are not as antagonistieally related as ar¬ 
gued by Heylighen and Dewaele; (b) our implieature 
annotation by Turkers showed a central tendency 
bias and poor agreement between two MTurk ex¬ 
periments, so maybe the mean implieature ratings 
we obtained are not truly refleetive of the aetual 
amount of implieature present in a sentenee. Val¬ 
idating whieh of these two (or maybe both) is the 
eorreet reason, is a part of our future work. 

Note further from Table 4 that formality and in¬ 
formativeness are positively eorrelated (moderate- 
to-good eorrelation) with length of the sentenee - 
in words and eharaeters. This eorroborates the ear¬ 
lier finding by Lahiri et al. (2011) that as a pieee 
of text gets more formal, it tends to beeome longer 
and more intrieate. Formality and informativeness 
also eorrelate positively (moderate eorrelation) with 
Heylighen and Dewaele’s F-seore, exeept in the Fo¬ 
rum genre. On the other hand, they do not have 
signiheant eorrelations with the informativeness (I) 
seore exeept the Forum genre. Implieature has a 
signiheant, but small negative eorrelation with F- 
seore in all eases. Lexieal density negatively eorre- 
lates with length of the sentenee (#words and #char- 
aeters). Informativeness seore eorrelates positively 
with length, but negatively with Heylighen and De¬ 
waele’s F-seore, as expeeted. Implieature also eor¬ 
relates negatively with F-seore in all eases. The two 
length seores have an almost perfeet positive eorre¬ 
lation among them, whieh is unsurprising. 

The surprising part, however, is that formality 
and informativeness (as rated by humans) are not 
very highly eorrelated (either positively or nega¬ 
tively) with Heylighen and Dewaele’s F-seore or 
our informativeness (I) seore. Maybe these two 
seores are measuring eomplementary aspeets of the 
phenomenon of formality, and are not individually 
able to explain all the variations. Automated seor- 
ing/predietion of formality by modeling it on top of 



scores like these (perhaps as features) is our future 
plan. We would also like to investigate how to pre¬ 
dict informativeness, and how to get a better handle 
on implicature scoring - both by humans as well as 
automated. 

4.3 Sentential Make-up of Documents 

In our final experiment, we investigated how the sen¬ 
tences in a document vary in terms of formality, im¬ 
plicature, and informativeness - starting from the 
beginning sentences, then the middle ones, and fi¬ 
nally fhe lasf ones. We divided fhe senfences info 
fen successive bins (deciles) based on fheir position 
in fhe documenf, and measured fhe mean formalify, 
informafiveness, and implicafure per decile. The re- 
sulfs - averaged across all documenfs in a parficular 
genre (blog, forums, news, overall) - are shown in 
Figure 2. Figure 2 also shows fhe sfandard errors for 
each decile. 

Nofe from Figure 2 fhaf news senfences are mosf 
formal and mosf informafive, followed by blog sen- 
fences, followed by forum senfences. In ferms 
of formalify and informafiveness frends, news sen- 
fences sfarf wifh high formalify and informafiveness, 
fhen gradually diminish in bofh - perhaps refiecf- 
ing fhe facf fhaf in journalistic writing, firsl few sen- 
fences carry fhe mosf informafion (fo cafch fhe read¬ 
ers’ affenfion), and fhe informafion/inferesfing-ness 
confenf decreases subsfanfially fhereaffer. Forum 
senfences, on fhe ofher hand, mainfain a low level 
of formalify and informafiveness fhroughouf - wifh 
a few small peaks and valleys in-befween. For blogs, 
fhe frend is firsf decreasing, fhen increasing, and 
fhen decreasing again - indicating fhaf fhe mosf in¬ 
formafive (and formal) senfences in blogs may be in 
fhe middle. All fhree genres faken fogefher, bofh for¬ 
malify and informafiveness show a decreasing frend. 
There is no clear frend in fhe implicafure rafing of 
senfences - if is mosfly an assorfmenf of peaks and 
valleys. 

5 Conclusion 

In fhis paper, we infroduced a dafasef of 7,032 sen- 
fences rafed for formalify, informafiveness, and im¬ 
plicafure on a 1-7 scale by human annofafors on 
Amazon Mechanical Turk. To fhe besf of our knowl¬ 
edge, fhis is fhe firsf large-scale annofafion efforf 


fhaf fies fogefher all fhree pragmafic variables af fhe 
senfence level. We measured reliabilify of our an- 
nofafions by running fwo independenf rounds of an¬ 
nofafion on MTurk, and inspecfing fhe correlafion 
among mean rafings befween fhe fwo rounds. We 
furfher examined correlafion of our annofafions wifh 
pilof senfence formalify annofafions done in a more 
confrolled setting (Lahiri and Lu, 2011). If was 
observed fhaf while formalify and informafiveness 
can be reliably annofafed on a 1-7 scale, implica¬ 
fure poses a much more difficulf challenge. We ana¬ 
lyzed fhe disfribufion of formalify, informafiveness, 
and implicafure across fhree genres (news, blogs, 
and forums), and found significanf differences - 
bofh in ferms of overall disfribufion, and also in 
ferms of fhe documenfs’ senfenfial make-up. Cor- 
relafions befween fhe human rafings and five ofher 
sfylisfic variables were carefully examined. Our fu- 
fure plans include an aufomafic senfence-level for¬ 
malify and informafiveness predictor, in fhe same 
spirif as (Danescu-Niculescu-Mizil ef ah, 2013). We 
also plan fo invesfigafe implicafure rafing more fhor- 
oughly, and figure ouf a good way fo improve relia¬ 
bilify in implicafure annofafion. 

The limifafions of our sfudy mosfly stem from our 
lack of confrol on fhe MTurk experimenfs. Some 
of fhaf is infenfional, because we really wanfed 
fo observe whaf people fhink/feel as formal, infor¬ 
mafive, and implicafive. However, previous sfud- 
ies have employed measures like background ques¬ 
tionnaires, linguistic affenfiveness surveys, and z- 
scoring fo weed ouf/smoofh difficulties (Danescu- 
Niculescu-Mizil ef ah, 2013). While fhese are in¬ 
deed promising research directions to fry, we opine 
fhaf even wifhouf such sfringenf measures, we were 
able fo obfain quife good annofafions - excepf impli¬ 
cafure, where fhe earlier approach of Degen (2015) 
may fruly be very helpful. 
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